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1. INTRODUCTION 

Indonesia has immense tourism potential and incredible diversity, making it attractive to both local 
and international tourists [1], [2]. In an effort to promote Indonesian tourism, the government uses the tagline 
"wonderful Indonesia" as the identity of Indonesian tourism [3], [4]. In the current digital era, Twitter has 
become one of the popular platforms for various information and opinions. Based on the We Are Social 
report in Figure 1, the number of Twitter users in Indonesia reached 18.45 million in 2022 [5]. This number 
is equivalent to 4.23% of the total Twitter users worldwide, which reached 436 million. These figures place 
Indonesia as the fifth-largest country in terms of Twitter users globally. Therefore, analyzing the sentiment of 
Twitter users towards Indonesian tourism using the keyword "wonderful Indonesia" becomes crucial to 
understand public perspectives on Indonesian tourism. 

In this article, the K-nearest neighbor (KNN) method will be used to analyze the sentiment of 
Twitter users [6]. The aim of this analysis is to determine the level of satisfaction and perspectives of Twitter 
users towards Indonesian tourism [7]. Research by Syarifuddin regarding public opinion on Twitter on the 
government's large-scale social restrictions (PSBB) (pembatasan sosial berskala besar) (PSBB) policy, or 
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the policy of restricting movement and stricter community activities which is also commonly known as 
lockdown in many countries. The algorithm used in the study is also one of the algorithms employed by 
Syarifuddin. He utilized three algorithms, namely decision tree, KNN, and naive bayes, with the aim of 
finding the best accuracy value in the prediction process. Among the three algorithms used, the decision tree 
algorithm yielded the best results with an accuracy of 83.3%, precision of 79%, and recall of 87.17%. 

On the other hand, in a research study conducted on sentiment analysis towards the reopening of 
tourist destinations amidst the COVID-19 pandemic, the naive bayes algorithm and the KNN algorithm were 
utilized to classify tweet data as positive or negative [9]. The research findings revealed that the naive bayes 
algorithm achieved the highest accuracy rate of 75.53%, with a positive precision of 71% and a positive 
recall of 99%. Meanwhile, the KNN algorithm obtained the highest accuracy rate of 48.66%, with a positive 
precision of 69%, and a positive recall of 69%. 

The novelty of this research is a deeper understanding of public sentiment in relation to tourism 
promotion efforts in Indonesia through the "wonderful Indonesia" campaign, supported by social media data 
analysis. Along with the development of technology, the important role of social media in shaping public 
opinions and views is becoming increasingly apparent. Therefore, this article discussed how the the KNN 
method is used in analyzing Twitter user sentiment towards Indonesian tourism using the keyword 
"wonderful Indonesia" from January 2021 to November 2022. We will delve into the data collection process 
from Twitter and how the KNN method is implemented in sentiment analysis. Additionally, this research can 
provide insights into how public feelings and intentions regarding tourism can be influenced by social media 
and aid in making strategic decisions for the tourism industry [10]. It can also offer information on traveler 
trends and preferences, thus assisting in tourism product planning and development. 
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Figure 1. Indonesian Twitter users (2019-2022) 
2. METHOD 


The method used is the KNN method. As depicted in Figure 2, the KNN method broadly consists of 
the following steps: scraping, preprocessing, sentiment data, data splitting, training data, testing data, and 
data visualization [11]. The KNN method was chosen because the stages in system development using the 
KNN method are considered to be clearly structured. 

Apart from the algorithm, the use of larger and more representative datasets is also a key factor in 
achieving better results. Larger datasets give the model more examples to learn and adapt to a wider variety 
of language and user communication styles. This allows the model to produce a more accurate and 
generalized representation of the sentiments expressed in tweets. In addition, new techniques in natural 
language processing are also applied to derive more informative features from the tweet text, ultimately 
improving the quality of sentiment prediction. 
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Figure 2. Road map of the KNN method employed 


3. RESULTS AND DISCUSSION 
3.1. Scraping data 

Data scraping is an automation technique used to extract data from websites, databases, enterprise 
applications, or legacy systems, which can then be saved in a tabular or spreadsheet format [12]. In this 
study, data scraping was performed on Twitter social media using Google Collaboratory and the Python 
programming language [13]. The retrieved tweet data was then stored in a CSV file, resulting in a total of 
16,543 data. This stage was crucial in obtaining the necessary data for the sentiment analysis of Twitter users 
towards Indonesian tourism using the keyword “wonderful Indonesia”. Table 1 shows the data successfully 
collected from the data scraping stage. 


Table 1. Data scrapping results 
Datetime Tweet Username 
2021-01-30 23:59:29 Last Moments of the Month. #waterfall #Sumaterabarat #wonderfulindonesia afrinaldomirfen 
#November (In Indonesian) 
2021-01-30 16:19:36 Tourism Greetings! Wonderful Indonesia! Be enthusiastic about taking part, empowered bemkm_ubj 
for change! Contact Person: 081213313516 (Mila) Thank you 
#KabinetDayaJuang #BEMKMUBJ2022 Ministry of Tourism and Creative Economy 
(HALMAS Coordinating Ministry) BEM KM UBJ 2022-2023 (In Indonesian) 


2021-01-30 14:56:10 Hurry up and save it and mark it on your calendar, friend! #Wonderfullndonesia (In pesonaindonesia 
Indonesian) 

2021-01-30 14:50:38 — Love catcher in Bali is fun too. Not bad, Aqua and Wonderful Indonesia are very _fleurdevella 
popular (In Indonesian) 

2021-01-30 14:49:36 BUMN Minister Erick Thohir: Mutual Cooperation is the Strength of the Indonesian CryptoKuta 


Nation, the Heritage of Our Ancestors. And we must preserve it. .. Wonderful 
Indonesia...! @erickthohir #RiseTogetherET (In Indonesian) 


3.2. Preprocessing 

Preprocessing is the stage where the obtained tweets are cleaned by removing duplicate tweets, RT, 
#, @, numbers, emoticons, and other symbols [14]. This stage involves several processes, including data 
cleansing, tokenization, removal of stopwords, and weighting [15]. The commands used in preprocessing are 
designed to simplify the application to the data used. 


3.2.1. Cleansing 
The cleansing stage involves cleaning the data obtained in the sentiment analysis process. The goal 
is to remove noise and increase the validity of the data before performing sentiment analysis [16]. This stage 
includes several processes as follows: 
— data = data.drop duplicates(subset=[’’]) to remove duplicate data, which refers to data that have the same 
sentence. As a result, the obtained data is reduced from 16,543 entries to 14,189 entries. 
— df= data.reset index(drop=True) to normalize the index order again. 
— Tweet = re.sub(?@[A-Za-z0-9]+’, ”, Tweet) to remove the mention (@) from the username. 
— Tweet = re.sub(’#’,”, Tweet) to remove the hashtag (#). 
— Tweet =re.sub(’RT[s]+’, ”, Tweet) to remove retweets (RT) 
— Tweet = re.sub(https?:§+’, ”, Tweet) to remove the hyperlink (URL). 
— Tweet =re.sub(’’www.S+’, ”, Tweet) to remove the website link. 
— Tweet = re.sub(’[A-Za-z * ]’),”, Tweet) to remove characters other than the letters A-Z. 
— Tweet = re.sub(’(d)’,”, Tweet) to remove digit characters in the Tweet string. 
— Tweet =re.sub(’t’, ””’, Tweet) to remove the tab character in the tweet string. 
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3.2.2. Tokenizing 

Tokenization is an important step in text processing that involves breaking a text or sentence into 
smaller parts called tokens [17]. Each token is usually the smallest unit of language, such as a word or 
punctuation mark. The tokenization process aims to break text into elements that are easier for computers to 
manage and understand. This process is done with the function: 
from nltk.tokenize import word_tokenize 
data['tokens] = data[text].apply(dambda x: word_tokenize(x)) 


3.2.3. Remove stopwords 
Stopwords are a crucial step in text processing that aims to remove words that tend not to contribute 
significant meaning in a given language [18]. In Indonesian, this process often uses literary libraries to 
identify and remove words that are considered common and do not provide deep meaning in the context of a 
sentence or document. 
def remove_stop_words(tokens): 
filtered_tokens = [token for token in tokens if token not in stop_words] 
return filtered_tokens 
def convert_slangword(tokens): 
normalized_words = [slang_dict[ word] if word in slang_dict else word for word in tokens] 
return normalized_words 
data['stemmed_tokens'] = data['tokens'].apply(remove_stop_words) 
data['lemmatized_tokens'] = data['stemmed_tokens'].apply(convert_slangword) 


3.2.4. Weighting 

Weighting is an important process in text analysis that aims to assign weight or importance to words 
in a document or corpus based on their relative frequency or significance [19]. In text analysis, not all words 
contribute equally to the meaning or information contained in the text. Therefore, weighting helps distinguish 
frequently occurring (common) words from infrequently occurring (rare) words, thus creating a more 
accurate representation of the text. In this stage, the "get word weights" function is used. 


3.3. Data labeling 

Labeling is the process stage of scanning or labeling the data to identify the sentiment in the data 
[20]. Labeling is done using the TextBlob library which has an implementation of the sentiment analysis 
model by calling ’TextBlob(text).sentiment.polarity’, it will get a sentiment value between -1 which means 
negative, 0 which means neutral, and 1 which means positive [21]. At this stage, the sentiment results are 
obtained with a total of negative 156 data, neutral 9242 data, and positive 4791 data. Table 2 displays the 
successfully labeled data resulting from the data labeling process. In this stage using functions, 
def getAnalysis(score): 
if score <0: 
return 'Negative' 
elif score == 
return ‘Neutral’ 
else: 
return 'Positive' 


Table 2. Result of data labeling 


Datetime Tweet Username Sentiment 
2021-01-30 23:59:29 Last Moments of the Month. #waterfall #Sumaterabarat afrinaldomirfen Neutral 
#wonderfulindonesia #November (In Indonesian) 
2021-01-30 16:19:36 Tourism Greetings! Wonderful Indonesia! Be enthusiastic about taking bemkm_ubj Positive 
part, empowered for change! Contact Person: 081213313516 (Mila) 
Thank you #KabinetDayaJuang #BEMKMUBJ2022 


Ministry of Tourism and Creative Economy (HALMAS Coordinating 
Ministry) BEM KM UBJ 2022-2023 (In Indonesian) 


2021-01-30 14:56:10 Hurry up and save it and mark it on your calendar, friend! pesonaindonesia Neutral 
#Wonderfullndonesia 

2021-01-30 14:50:38 Love catcher in Bali is fun too. Not bad, Aqua and Wonderful _fleurdevella Positive 
Indonesia are very popular (In Indonesian) 

2021-01-30 14:49:36 | BUMN Minister Erick Thohir: Mutual Cooperation is the Strength of CryptoKuta Neutral 


the Indonesian Nation, the Heritage of Our Ancestors. And we must 
preserve it... Wonderful Indonesia...! @erickthohir #RiseTogetherET 
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3.4. Splitting data 

Splitting data is a process stage to divide the dataset into different parts to perform model evaluation 
and testing [22]. The splitting process is done by dividing the dataset into two parts, namely the training part 
and the testing part [23]. Training data is used to train the model and test data is used to evaluate the model. 
Data splitting is done with a ratio of 8:2, because this process can reduce the risk of overfitting and provide 
good accuracy. until obtained. This process is done with the help of the scikit-learn library, 
X = data[[’ Subjectivity’, ’Polarity’]] 
y = data[’Sentiment’] 
X train, X test, y train, y test = train test split(X, y, test size = 0.25, random state = 0) 
The data is then split into 10641 train data and 3548 test data. 


3.5. Training data 

Training data is integral to the formation and development of algorithms in machine learning [24]. 
The concept involves using a subset of the overall dataset collected to train a model or algorithm so that it 
can understand the patterns and relationships among the variables. In the context of machine learning, 
training data is the basis for teaching the model how to make accurate predictions or decisions based on the 
information at hand. Training is done using the library ‘from sklearn.neighbors import KNeighborsClassifier' 
then using the command, 
K=3 
model = KNeighborsClassifier(n neighbors=K) 
model.fit(X train, y train) 


3.6. Test data 

Test data is a portion of the dataset used to evaluate the performance of the model built with the 
training data [25]. Test data does not participate in the model training process. The test data is accessed using 
the commands 'x_test’ and 'y_test', which contain the test data and the corresponding sentiment labels. The 
command '‘clf.predict(X_test)' is then used to make predictions, and the model's accuracy is evaluated by 
comparing the predicted results with the actual labels using the command ‘accuracy_score(y_test, y_pred)’. 

Prediction, in this context, refers to the ability to predict the class or sentiment label of unknown 
sentiment texts. In sentiment analysis, a model is developed and trained using labeled training data, enabling 
it to learn patterns and trends that occur in texts with known sentiment. Once trained, the model can be used 
to predict the sentiment of new, unlabeled texts. After performing the test data, predictions can be made and 
compared with the actual data using a confusion matrix for better understanding. 
cm = confusion_matrix(y_test, y_pred) 
class_names = ['Negative', 'Neutral’, 'Positive'] 
plot_confusion_matrix(model, X_test, y_test, display_labels=class_names) 
plt.xlabel('Predicted’) 
plt.ylabel(‘True’) 

To facilitate readability of the matrix, the sentiment labels -1, 0, and 1 found in model.classes_ are 
replaced with "negative," "neutral," and "positive" respectively. The results of the above command can be 
seen in Figure 3. 

— Correctly predicted negative label: 0 data 

— Negative label predicted as neutral: 9 data 

— Negative label predicted as positive: 33 data 
— Neutral label predicted as negative: 0 data 

— Correctly predicted neutral label: 2279 data 
— Neutral label predicted as positive: 19 data 
— Positive label predicted as negative: 0 data 
— Positive label predicted as neutral: 0 data 

— Correctly predicted positive label: 1208 data 

Based on the above data, it can be concluded that the most accurate prediction is in the positive 
label. However, there are still errors in predicting the neutral and negative labels. After that, scoring is 
calculated using scikit-learn. The resulting calculations are as: 

Accuracy: 0.9828072 153325818 
Precision: 0.9715633303499707 
Recall: 0.9828072153325818 
Fl-score: 0.977034183850208 

The built model shows excellent results in predicting sentiment on the test data. With an accuracy of 
98.28%, the model is able to correctly predict almost all of the data used in the testing phase. The precision 
of 97.16% indicates that the majority of positive predictions made by the model are correct. A recall of 
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98.28% indicates that the model can accurately identify almost all true positive instances in the dataset. The 
Fl-score of 0.97 demonstrates a good balance between precision and recall. With consistently high 
performance across all evaluation metrics, it can be concluded that this model is effective in predicting 
sentiment on the test data. 

In comparison with the previously mentioned research results, this study proves excellent quality in 
predicting sentiment on test data. Research by Syarifuddin [8] in the context of public opinion regarding PSBB 
policies resulted in the best accuracy of 83.3% with the decision tree algorithm. Research by Era et al. [9] 
related to sentiment towards the reopening of tourist destinations using the naive bayes and KNN algorithms 
with naive bayes accuracy of 75.53% and KNN of 48.66%. 
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Figure 3. Confusion matrix 


3.7. Visualization 

Visualization is a key step in data analysis that aims to present information visually through graphs, 
diagrams, plots, or word cloud. The main goal of visualization is to transform complex data into a more 
intuitive and understandable representation [26]. Through the use of different types of visualizations, such as 
bar charts, pie charts, scatter plots, or heat maps, scattered and complex data can be interpreted more clearly 
and effectively. 


3.7.1. Visualization of each month 

In Figure 4, the positive sentiment in November has the highest number. It tends to be because the 
amount of data available in November is more than other months. However, August has more negative 
sentiment compared to other months and in comparison during January to December the number of negative 
sentiments is much less than the number of positive sentiments. This shows that the sentiment of Twitter 
users towards Indonesian tourism has a positive sentiment (a good response to Indonesian tourism). 
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Figure 4. Visualization of the number of sentiments per month 


3.7.2. Visualization of sentiment percentage 
Visualization of sentiment percentage is done in order to find out what the opinion or sentiment of 
the Twitter social media tweets of the community is like by looking at the percentage of sentiment obtained. 
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Figure 5 shows a diagram of the percentage of community sentiment on Twitter social media against the 
keyword “wonderful Indonesia” The percentage in this chart shows that negative sentiment has the smallest 
percentage value than positive or neutral sentiment. 
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Figure 5. Sentiment percentage from January 2021 to November 2022 


3.7.3. Word cloud visualization 

Word cloud visualization is a graphical representation of a collection of words, where the size of 
each word corresponds to its frequency [27]. In this visualization, the most frequently occurring words in the 
dataset are displayed in larger sizes, while less common words are displayed in smaller sizes. Based on the 
word cloud in Figure 6, it can be concluded that the words "wonderful," "Indonesia," and "Mandalika" are 
frequently discussed topics. This is because the data obtained consists mostly of tweets about promoting 
tourism in Indonesia, which have a neutral sentiment. During that time, there was also a lot of discussion 
about Mandalika as a new tourist attraction in Indonesia. 
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Figure 6. WordCloud visualization 


3.7.4. Table visualization 

Table visualization is a graphical representation of data where the data is organized in columns and 
rows. Table visualizations allow researchers to clearly display relevant data [28]. The purpose of table 
visualization is to present data in a structured format. Table 3. shows the results of tweets with negative 
sentiment. 
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Table 3. Table tweet sentiment negative 


Date Tweet Username Tweet 
2021-04-19 Pink Beach, East Lombok, road access is bad, not really, BLabuanbajo Pink Beach, East Lombok, poor 
the dirt road is rocky and has holes, making you lazy road access, not really, rocky dirt 
about planning to go on holiday there (In Indonesian) roads and potholes, making it 


lazy to plan a vacation there 
#InIndonesiaAja #Wonderfullndonesia #Lombok 
#Lomboklsland #PulauLombok #WisatadiLombok 
#TripLombok #LombokTour #PantaiPinik #Pink 
#PinkBeach https://t.co/LILUMSPSNR 


2021-04-13 Directions on Mount Agung are quite difficult and very dombacoklat The directions on Mount Agung 
prone to misleading climbers when the weather starts to are quite difficult and very prone 
get foggy. #WonderfulIndonesia (In Indonesian) to misleading climbers when the 

weather starts to fog up. 

2021-03-18 | How come there are hotels whose rooms are cramped, indiraysm How come there are hotels with 
damp, have no air ventilation but are still operating and cramped, humid rooms, no air 
charge high prices, really disappointing (In Indonesian) ventilation but still operating and 

charging expensive prices, really 
#Kemenparekraf disappointing. 
#Wonderfullndonesia 

2021-02-26 @ayam_kinantan_reds how are you gbla, there's just a bubudebo how are you doing gbla, there is a 
match starting to get a lot of cracks in the walls and the match just started a lot of cracks 
toilets are dirty, especially now it's really bad for sure (In in the wall and the toilet is dirty, 
Indonesian) what's worse now is definitely 
#Wonderfullndonesia 


#forfootballcultureinindonesia 
2021-02-18 Andi Mattalatta Mattoanging Stadium has the ayam_kinantan_reds andi  mattalatta matoanging 


unfortunate fate, abandoned, unused, not surprisingly stadium poor fate, abandoned, not 
wonderful Indonesia (In Indonesian) repaired, no wonder wonderful 
indonesia 


In the effort to improve management related to negative sentiment towards tourism in Indonesia, 
there are several suggestions that can be implemented. Firstly, there is a need for infrastructure improvement 
by repairing poor road access, such as better paving and maintenance. This action will enhance the comfort 
and safety of tourists when visiting tourist destinations. Furthermore, there is a need for an enhancement in 
the quality of services in the tourism sector, especially in hotels. Better training for hotel employees, 
maintenance improvements, and a focus on customer satisfaction will help improve guest experiences and 
reduce negative sentiment related to poor service. 

In addition, efforts are also needed to improve cleanliness and environmental management in 
tourism. This refers to complaints related to beach cleanliness and waste management. Improving waste 
management, environmental awareness campaigns, and involving community participation in maintaining 
cleanliness can help reduce negative sentiment related to environmental cleanliness in tourism. In the context 
of consumer protection, it is important to enhance transparency in tourism transactions and protect consumers 
from fraudulent practices or unfair pricing. Clear regulations, strong law enforcement, and education for 
tourists are necessary to build trust and reduce negative sentiment related to consumer protection. 

Lastly, there is a need to improve coordination and communication among stakeholders such as 
tourism authorities, destination managers, and local communities. Accurate information, clear directions, and 
open communication will help reduce confusion and enhance the tourists’ experience, as well as address 
complaints related to directions and destination conditions. By implementing these suggestions, it is hoped 
that tourism management can be enhanced and negative sentiment experienced by tourists can be reduced, 
thus providing a more positive experience and improving perceptions of tourism in Indonesia. 


4. CONCLUSION 

Based on the results of the research conducted, it is concluded that the majority of Twitter users’ 
tweets contain neutral sentiments related to tourism in Indonesia. However, if neutral sentiments are ignored, 
Twitter users’ views on tourism tend to be positive. It can be seen that the percentage of positive sentiment 
reaches 33.8%, while negative sentiment is only 1.1%. This trend is reflected in key words such as 
"wonderful", "Indonesia", and "Mandalika" that frequently appear in discussion topics. This connection can 
be seen from the attention given to Mandalika as one of the main tourist destinations in Indonesia in those 
years. The analysis results using the KNN algorithm showed an accuracy rate of 98.2%, recall 97.1%, 
precision 98.2%, and Fl-score 97.7%, reflecting a highly accurate and valid evaluation for policy purposes. 
For suggestions in future research, it is recommended to comprehensively consider the following aspects. 
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One important aspect is to conduct comparisons with other methods and algorithms to test and strengthen the 
research findings. By adopting a variety of approaches, researchers can gain a broader understanding and 
increase the reliability of the findings. This approach will allow future research to provide a more 
comprehensive and in-depth insight into Twitter users' sentiment towards tourism in Indonesia. 
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